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1  Introduction 


Let  x  be  a  probability  distribution  on  a  measurable  space  (X,B),  and  suppose  that  we  are 
interested  in  estimating  characteristics  of  it  such  as  i r(E)  or  /  / dx  where  E  €  B  and  /  is 
a  bounded  measurable  function.  Even  when  x  is  fully  specified  one  may  have  to  resort  to 
methods  like  Monte  Carlo  simulation  methods,  especially  when  x  is  not  computationally 
tractable.  For  this  one  uses  the  available  huge  literature  on  generation  of  random  variables 
from  an  explicitly  or  implicitly  described  probability  distribution  x.  Generally  these  meth¬ 
ods  require  X  to  be  the  real  line  or  require  that  x  have  special  features,  such  as  a  structure 
in  terms  of  independent  real  valued  random  variables.  When  one  cannot  generate  random 
variables  with  distribution  x  one  has  to  be  satisfied  with  looking  for  a  sequence  of  random 
variables  A'i,  A'2, . . .  whose  distributions  converge  to  x  and  using  Xn  with  a  large  index  n 
as  an  observation  from  x.  An  example  is  the  classical  Markov  chain  simulation  method, 
which  can  be  described  as  follows. 

Let  P(x,  A)  be  a  transition  probability  function  with  the  property  that  it  has  stationary 
distribution  x,  i.e. 

x(C)  =  J  P(x,C)x(dx )  for  all  C  €  B.  (1.1) 

We  fix  a  starting  point  x0,  generate  an  observation  Ai  from  P(x 0,  •),  generate  an  observation 

X2  from  P(Xi,-),  etc.  This  generates  the  Markov  chain  xo  =  Ao,Aj,  A2, _  Let  Pn(x,-) 

denote  the  distribution  of  Xn  when  the  chain  is  started  at  x.  If  we  can  show  that 


sup  | Pn(x,  C )  —  x(C)|  — ►  0  for  all  x  €  X, 
ceB 


then  by  running  the  chain  sufficiently  long,  we  succeed  in  generating  an  observation  Xn 
with  distribution  approximately  x.  Then,  we  may  estimate  x  for  example  by  generating  G 
such  chains  in  parallel,  obtaining  independent  observations  X^\. . .  ,X^G\  or  by  running 
one  (or  a  few)  very  long  chains.  In  Section  3  we  make  some  remarks  on  the  advantages  and 
disadvantages  of  these  two  methods. 

The  Metropolis  algorithm  and  its  variants  produce  Markov  transition  functions  satisfy¬ 
ing  (1.1).  This  algorithm  was  originally  developed  for  estimating  certain  distributions  and 
expectations  arising  in  statistical  physics,  but  can  also  be  used  in  Bayesian  analysis;  see 
Tierney  (1991)  for  a  review. 

However,  in  the  usual  problems  of  Bayesian  statistics,  the  most  commonly  used  Markov 

chain  is  one  that  is  used  to  estimate  the  unknown  joint  distribution  x  =  x^j) . X{P)  of 

the  (possibly  vector- valued)  random  variables  (A*1*, . . . ,  A^)  by  updating  the  coordi¬ 
nates  one  at  a  time,  as  follows.  We  suppose  that  we  know  the  conditioned  distributions 
j#.}*  1  =  1, ...  ,p  or  at  least  that  we  are  able  to  generate  observations  from  these 
conditional  distributions.  If  Xm  =  (A*J\  . . . ,  A^)  is  the  current  state,  the  next  state 


Xm+i  =  (A^jj , . . . ,  A^fjj)  of  the  Markov  chain  is  formed  as  follows.  Generate  X^Ij  from 

then  Xi%,  from  . Xjl), 

and  so  on  until  A'lfj,  is  generated  from  •  .Xf'+V),).  if  P  is  the 

transition  function  that  produces  Arm+i  from  Xm,  then  it  is  easy  to  see  that  P  satisfies  (1.1). 

This  method  is  reminiscent  of  the  simulation  method  described  in  Geman  and  Geman 
(19S4).  In  that  paper,  p,  the  number  of  coordinate  indices  in  the  vector  (A^1^, . . . ,  A^), 
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is  usually  of  the  order  of  N  x  N  where  N  =  256  or  higher.  They  assume  that  these  indices 
form  a  graph  with  a  meaningful  neighborhood  structure  and  that  7r  is  a  Gibbs  distribution, 
so  that  the  conditional  distributions  *x<i>|{;r<»  *  =  T  •  •  •  ,P  depend  on  much  fewer  that 

p—  1  coordinates.  They  also  assume  that  each  random  variable  X,  takes  only  a  finite  number 
k  of  values  and  that  n  gives  positive  mass  to  all  possible  k^3  values.  Geman  and  Geman 
(1984)  appeal  to  the  ergodic  theorem  on  Markov  chains  with  a  finite  state  space  and  prove 
that  this  simulation  method  works.  They  prove  other  interesting  results  on  how  this  can 
be  extended  when  a  temperature  parameter  T  (that  can  be  incorporated  into  tt)  is  allowed 
to  vary.  This  may  be  the  reason  why  the  method  described  in  the  previous  paragraph 
has  come  to  be  known  as  the  Gibbs  sampler.  We  consider  this  to  be  a  misnomer,  because 
no  Gibbs  distribution  nor  any  graph  with  a  nontrivial  neighborhood  structure  supporting 
a  Gibbs  distribution  is  involved  in  this  method;  we  will  refer  to  it  simply  as  successive 
substitution  sampling. 

We  note  that  this  algorithm  depends  on  ir  only  through  the  conditional  distributions 
j^i)-  Perhaps  the  first  thought  that  comes  to  mind  when  considering  this  method 
is  to  ask  whether  or  not,  in  general,  these  conditionals  determine  the  joint  distribution  ir. 
The  answer  is  that  in  general  they  do  not;  we  give  an  example  in  Remark  3  of  Section  2.2. 
A  necessary  consequence  of  convergence  of  successive  substitution  sampling  is  that  the  joint 
distribution  is  determined  by  the  conditionals.  It  is  therefore  clear  that  any  theorem  giving 
conditions  guaranteeing  convergence  also  gives,  indirectly,  conditions  which  guarantee  that 
the  conditionals  determine  the  joint  distribution  x. 

We  now  give  a  very  brief  description  of  how  this  method  is  useful  in  some  Bayesian 
problems.  We  suppose  that  the  parameter  9  has  some  prior  distribution,  that  we  observe  a 
data  point  Y  whose  conditional  distribution  given  9  is  C{Y  [  9),  and  that  we  wish  to  obtain 
C{9  |  Y),  the  conditional  distribution  of  9  given  Y .  It  is  often  the  case  that  if  we  consider 
an  (unobservable)  auxiliary  random  variable  Z,  then  the  distribution  i rg,z  =  C{9,  Z  \  Y)  has 
the  property  that  -k6\z  (=  C(9  \  Y,  Z ))  and  Xz\e  (=  C(Z  |  Y,  9))  are  easy  to  calculate.  Typical 
examples  are  missing  and  censored  data  problems.  If  we  have  a  conjugate  family  of  prior 
distributions  on  9,  then  we  may  take  Z  to  be  the  missing  or  the  censored  observations, 
so  that  7Tfl| z  is  easy  to  calculate.  Successive  substitution  sampling  then  gives  a  random 
observation  with  distribution  (approximately)  C(9,  Z  |  Y ),  and  retaining  the  first  coordinate 
gives  an  observation  with  distribution  (approximately)  equal  to  C(9  |  Y). 

Another  application  arises  when  the  parameter  9  is  high  dimensional,  and  we  are  in  a 
nonconjugate  situation.  Let  us  write  9  =  (9i, ...  ,9k),  so  that  what  we  wish  to  obtain  is 

Tfe, . ek  ■  Direct  calculation  of  the  posterior  will  involve  the  evaluation  of  a  Jt-dimensional 

integral,  which  may  be  difficult  to  accomplish.  On  the  other  hand,  application  of  the 
successive  substitution  sampling  algorithm  involves  the  generation  of  one-dimensional  ran¬ 
dom  variables  from  which  is  available  in  closed  form,  except  for  a  normalizing 

constant.  There  exist  very  efficient  algorithms  for  doing  this;  see  Zaman  (1992). 

Let  us  now  return  to  the  Markov  chain  simulation  method.  Let  P  be  a  transition 
probability  function  on  the  measurable  space  (X,B),  i.e.  P  is  a  function  on  X  x  B  such 
that  for  each  x  €  X ,  P(x,  •)  is  a  probability  measure  on  (X,B),  and  for  each  C  6  B,  P(-,C) 
is  a  measurable  function  on  ( X,B ).  Let  Xo,JAi,...  be  a  Markov  chain  with  transition 
probability  function  P,  i.e.  P(Xn  €  C  |  Xn.i  =  x)  =  P(x,  C ),  for  n  =  1,2,....  If  XQ  =  x, 
we  will  say  that  the  Markov  chain  starts  at  x  and  for  any  event  C,  P(C  \  X0  =  z)  will  be 
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denoted  by  PX(C).  Similarly,  for  any  bounded  measurable  function  /  defined  on  the  Markov 
chain,  E(f\X0  =  a-)  will  be  denoted  by  Ex(f).  Let  P"(x,C)  =  P(Xn  €  C\X0  =  x). 
Suppose  that  x  is  a  probability  measure  on  {X,B)  and  is  a  stationary  probability  measure 
for  the  Markov  chain,  i.e.  it  satisfies  (1.1). 

When  supCe5  \Pn(x,C)  -  x(C)|  -+  0  (supC€B  £"=1  p3(x,  C)  -  x(C)|  -►  0)  for  a 
set  of  starting  points  x  which  has  probability  1  with  respect  to  the  stationary  measure 
x,  we  will  say  that  the  Markov  chain  is  ergodic  (mean  ergodic).  The  objective  of  this 
paper  is  to  give  theorems,  whose  conditions  are  very  simple  to  check,  and  which  guarantee 
ergodicity  or  mean  ergodicity.  (These  are  the  minimum  conditions  required  for  success 
of  Markov  chain  simulation  method.  Once  such  conditions  are  established,  it  is  useful  to 
also  make  a  statement  on  the  rate  of  convergence,  if  this  is  possible.)  Before  stating  these 
theorems,  we  will  need  a  few  definitions  concerning  Markov  chains.  For  any  set  C  €  B, 
let  Nn(C)  =  £m=i  I(Xm  €  C)  and  N{C)  =  £~=1 1(Xm  €  C)  be  the  number  of  visits 
to  C  by  time  n  and  the  total  number  of  visits  to  C,  respectively.  The  expectations  of 
Nn(C)  and  N(C ),  when  the  chain  starts  at  x,  are  given  by  Gn(x,  C)  =  £”  =1  Pm(x,  C)  and 
G(x,C)  =  £m=i  ^m(x.C),  respectively.  Define  T(C )  =  inf{n  :  n  >  0,  Xn  €  C)  to  be  the 
first  time  the  chain  hits  C ,  after  time  0.  Note  that  PX(T(C )  <  oo)  >  0  is  equivalent  to 
G(x,C)  >  0. 

The  set  A  €  B  is  said  to  be  accessible  if 

PX(T(A)  <  oo)  >  0  for  all  x  €  X. 

Let  p  be  a  probability  measure  on  (X,B).  The  Markov  chain  is  said  to  be  p-irreducible  if 
every  set  A  with  p(A)  >  0  is  accessible.  The  set  A  is  said  to  be  recurrent  if 

PX(T(A)  <  oo)  =  1  for  all  x  €  X. 

For  the  case  where  the  er-field  B  is  separable,  there  is  a  very  useful  equivalent  definition 
of  p-irreducibility  of  a  Markov  chain.  In  this  case,  we  can  deduce  from  Theorem  2.1  of 
Orey  (1971),  on  the  existence  of  “C-sets”,  that  p-irreducibility  of  a  Markov  chain  implies 
that  there  exist  a  set  A  €  B  with  p(A)  >  0,  an  integer  no,  and  a  number  e  >  0  satisfying 

PX(T(A)  <  oo)  >  0  for  all  x  €  <¥,  (1.2) 

and 

x  €  A,  C  C  A  imply  Pn°(x,  C)  >  cp(C).  (1.3) 

Let  pa{C)  =  is  well  defined  because  p(A)  >  0.  The  set  function  pA  is  a 

probability  measure  satisfying  Pa(A)  =  1.  Note  that  (1.2)  simply  states  that  A  is  an 
accessible  set  and  this  condition  does  not  make  reference  to  the  probability  measure  p. 
Condition  (1.3)  states  that  uniformly  in  x  €  A,  the  no-step  transition  probabilities  from 
x  into  subsets  of  A  are  bounded  below  by  e  times  p.  That  (1.2)  and  (1.3)  imply  p^- 
irreducibility  is,  of  course,  immediate.  This  alternative  definition  of  p^-irreducibility,  which 
applies  to  nonseparable  <7-fields  as  well,  will  be  usually  much  easier  to  verify  in  Markov 
chain  simulation  problems.  By  replacing  p  by  pa,  we  can  also  assume  with  no  loss  of 
generality  that  p  is  a  probability  measure  with  p(A)  =  1  when  verifying  Condition  (1.3). 

For  any  subset  M  of  the  positive  integers,  g.c.d.(Af)  will  denote  the  greatest  common 
divisor  of  the  integers  in  M. . 
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The  main  results  of  this  paper  are  the  following  two  theorems,  which  are  stated  for 
general  Markov  chains.  They  give  sufficient  conditions  for  the  Markov  chain  simulation 
method  to  be  successful. 

Theorem  1  Suppose  that  the  Markov  chain  {X„}  with  transition  function  P(x,C)  has 
an  invariant  probability  measure  n,  i.e.  (1.1)  holds.  Suppose  that  there  is  a  set  A  €  B,  a 
probability  measure  p  with  p(A)  =  1,  a  constant  e  >  0,  and  an  integer  n0  >  1  such  that 

t{x  :  PX(T(A)  <  oo)  >  0}  =  1,  (1.4) 

and 

Pn°(x,-)  >  ep(-)  for  each  x  €  A.  (1.5) 

Suppose  further  that 

g.c.d.lm  >  1  :  there  is  an  em  >  0  such  that  Pm(x,  •)  >  tmp(-)  for  each  x  €  A  j  =  1. 

(1.6) 

Then  there  is  a  set  D0  such  that 

t(Dq)  =  1  and  sup  |Pn(x,  C)  —  V(C)\  — ►  0  for  each  x  €  Do.  (1.7) 

czb 


Theorem  2  Suppose  that  the  Markov  chain  (Xn)  with  transition  function  P(x,C)  satis¬ 
fies  conditions  (1.1),  (1.4)  and  (1-5).  Then 

.  i  no-i  . 

sup  —  Y]  Pmno+T(x,C)  —  7T (C)  — »  0  as  m  oo  for  [k ]-almost  all  x,  (1.8) 

CeB'n o  ~Q  1 

and  hence 

1  n 

sup  I  —  Y]  P’(x,  C )  —  7r(C) I  — >  0  as  n  — »  oo  for  [7r]-aJmost  all  x.  (1.9) 

cgs'n^  1 

Let  f(x)  be  a  measurable  function  on  (X ,B)  such  that  f  x(dy)\f(y)\  <  oo.  Then 

Pj|—  yj  f(Xj)  —*  J  ?r(dj/)/(t/)|  =  1  for  [7r] -almost  all  x  (1.10) 

and 

-Y2EAf(Xj))  -»  f  7r (dy)f(y)  =  1  for  [ ir]-almost  all  x.  (1.11) 

n  j= i  J 

Variants  of  these  theorems  form  a  main  core  of  interest  in  the  Markov  chain  literature. 
However,  most  of  this  literature  makes  strong  assumptions  such  as  the  existence  of  a  recur¬ 
rent  set  A  and  proves  the  existence  of  a  stationary  probability  measure  before  establishing 
(1.7)  and  (1.8).  Theorems  1  and  2  exploit  the  existence  of  a  stationary  probability  measure, 
which  is  given  to  us  “for  free”  in  the  Markov  chain  simulation  method,  and  establish  the 
ergodicity  or  mean  ergodicity  under  minimal  and  easily  verifiable  assumptions.  For  exam¬ 
ple,  we  have  already  noted  that  in  the  context  of  the  Markov  chain  simulation  method,  we 
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really  need  to  check  only  (1.4),  (1.5),  and  (1.6).  To  show  (1.4)  in  most  cases  one  will  es¬ 
tablish  that  PX(T(A )  <  oo)  >  0  for  all  x.  Condition  (1.6)  is  usually  called  the  a periodicity 
condition  and  is  automatically  satisfied  if  (1.5)  holds  with  n0  =  1.  We  also  indicate  the 
critical  points  in  the  proof  where  one  can  use  additional  information  to  obtain  results  on 
the  rate  of  the  convergence. 

There  is  a  long  history  on  ergodic  theorems  for  Markov  chains.  For  the  case  where  X  is 
a  finite  space,  it  has  long  been  known  that  if  there  is  an  integer  k  and  a  point  y  €  X  such 
that  minxe^  Pk(x,  {y})  >  e  >  0,  then  (1.7)  holds  at  an  exponential  rate;  see  for  instance 
p.  173  of  Doob  (1953).  Other  sufficient  conditions  for  ergodicity  are  known  for  the  case 
where  X  is  a  countable  space.  See  e.g.  Theorems  1.2  and  1.3  of  Chapter  3  of  Karlin  and 
Taylor  (1975). 

In  interesting  problems,  including  those  that  arise  in  Bayesian  statistics,  the  state  space 
X  generally  is  not  countable.  Early  results  on  ergodicity  of  Markov  chains  on  general  state 
spaces  used  a  condition  known  as  the  Doeblin  condition;  see  Hypothesis  (D')  on  p.  197  of 
Doob  (1953),  which  can  be  stated  in  an  equivalent  way  as  follows.  There  is  a  probability 
measure  <f>  on  (X,B),  an  integer  fc,  and  an  e  >  0  such  that 

Pk(x,  C)  >  t<j>{C)  for  all  x  €  X  and  for  all  C  €  B. 

This  is  a  very  strong  condition.  It  implies  that  there  exists  a  stationary  probability  measure 
to  which  the  Markov  chain  converges  at  a  geometric  rate,  from  any  starting  point. 

Theorem  3  Suppose  that  the  Markov  chain  satisfies  the  Doeblin  condition.  Then  there 
exists  a  unique  invariant  probability  measure  ir  such  that  for  all  n 

sup  | Pn(x,  C)  -  tt(C)|  <  (1  -  e)^-1  for  all  x€X. 

c 

A  proof  of  this  theorem  may  be  found  on  p.  197  of  Doob  (1953).  The  Doeblin  condition, 
though  easy  to  state,  is  very  strong  and  rarely  holds  in  the  problems  that  appear  in  the 
class  of  applications  we  are  considering.  We  note  that  it  is  equivalent  to  the  conditions 
of  Theorem  1,  with  the  set  A  of  Theorem  1  replaced  by  X.  In  its  absence,  one  has  to 
to  impose  the  obvious  conditions  of  irreducibility  and  aperiodicity  and  some  other  extra 
conditions,  often  times  recurrence,  to  obtain  ergodicity.  Standard  references  in  this  area  are 
Orey  (1971),  Revuz  (1975)  and  Nummelin  (1984).  An  exposition  suitable  for  our  purposes 
can  be  found  in  Athreya  and  Ney  (1978).  Theorem  4.1  of  that  paper  may  be  stated  as 
follows. 

Theorem  4  Suppose  that  there  is  a  set  A  €  B,  a  probability  measure  p  concentrated  on 
A,  and  an  e  with  0  <  e  <  1  such  that 

PX(T(A)  <  oo )  =  1  for  all  x  €  A', 

and 

P(x,C)  >  tp(C)  for  all  x  €  A  and  all  C  €  B. 

Suppose  further  that  there  is  an  invariant  probability  measure  n.  Then 

sup  |Pn(x,  C)  —  7r(C)|  — ♦  0  for  all  x  €  X. 
c  gtf 
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This  theorem  establishes  ergodicity  under  the  assumption  of  the  existence  of  a  stationary 
probability  measure  but  also  makes  the  strong  assumption  of  the  existence  of  a  recurrent 
set  A.  It  is  always  difficult  to  check  that  a  set  A  is  recurrent.  Our  main  results,  Theorems  1 
and  2,  weaken  this  recurrence  condition  to  just  the  accessibility  of  the  set  A  from  [7r]-almost 
all  starting  points  x.  We  believe  that  this  makes  it  routine  to  check  the  conditions  of  our 
theorem  in  Markov  chain  simulation  problems. 

Tierney  (1991)  gives  sufficient  conditions  for  convergence  of  Markov  chains  to  their 
stationary  distribution.  The  main  part  of  his  Theorem  1  may  be  stated  as  follows. 

Theorem  5  Suppose  that  the  chain  has  invariant  probability  measure  n.  Assume  that  the 
chain  is  n -irreducible  and  aperiodic.  Then  (1.7)  holds. 

The  main  difference  between  Theorems  1  and  5  is  that  in  Theorem  1  the  probability  measure 
with  respect  to  which  irreducibility  needs  to  be  verified  is  not  restricted  to  be  the  stationary 
measure.  This  distinction  is  more  than  cosmetic.  To  check  7r-irreducibility,  one  has  to 
check  that  a  certain  condition  holds  for  all  sets  which  have  positive  probability  under  the 
stationary  distribution.  For  certain  Markov  chain  simulation  problems  in  which  the  state 
space  is  very  complicated,  it  is  difficult  or  impossible  to  even  identify  these  sets,  since  it 
is  difficult  to  get  a  handle  on  the  unknown  it.  An  example  of  such  a  situation  arose  in 
the  context  of  Bayesian  nonparametrics  in  Doss  (1991),  where  the  state  space  was  the  set 
of  all  distribution  functions.  In  that  paper,  the  author  was  not  able  to  get  enough  of  a 
handle  on  t  to  identify  those  sets  to  which  it  gives  positive  probability.  On  the  other  hand, 
a  convenient  choice  of  p  made  it  possible  to  check  />-irreducibility  through  the  equivalent 
Conditions  (1.4)  and  (1.5). 

Another  application  of  Theorem  1  in  the  area  of  Bayesian  nonparametrics  appears  in 
Escobar  and  West  (1991). 

We  point  out  that  Tierney  (1991)  does  not  give  a  detailed  definition  of  aperiodicity, 
but  refers  the  reader  to  Chapter  2.4  of  Nummelin  (1984)  where  an  implicit  definition  of 
the  period  of  a  Markov  chain  is  given.  In  the  present  paper,  aperiodicity  as  constructively 
defined  in  (1.6),  is  extremely  easy  to  check:  If  the  no  appearing  in  (1.5)  is  1,  then  (1.6)  is 
automatic. 

Results  which  give  not  only  convergence  of  the  Markov  chain  to  its  stationary  distri¬ 
bution  but  also  convergence  at  a  geometric  rate  are  obviously  extremely  desirable.  Such 
results  are  given  in  Theorem  1  of  Schervish  and  Carlin  (1990)  and  in  Proposition  1  of 
Tierney  (1991).  It  is,  however,  important  to  keep  in  mind  that  checking  conditions  that 
ensure  convergence  at  a  geometric  rate  is  usually  an  order  of  magnitude  more  difficult  than 
checking  the  conditions  needed  for  simple  convergence,  for  example  Theorems  1  and  5  in 
the  present  paper.  This  is  because  in  cases  where  the  dimension  of  the  state  space  of  the 
Markov  chain  is  very  high,  it  is  usually  extremely  difficult  to  check  the  integrability  condi¬ 
tions  needed.  This  situation  arises  in  Bayesian  nonparametrics  for  example;  see  Doss  (1991) 
for  an  illustration. 

In  addition,  the  Markov  chain  may  converge  but  not  at  a  geometric  rate.  This  can 
happen  even  in  very  simple  situations.  An  illustration  is  provided  in  the  example  below, 
which  is  due  to  T.  Sellke.  Let  U  be  a  random  variable  on  ft  with  distribution  v  which  we 
take  to  be  the  standard  Cauchy  distribution.  Let  the  conditional  distribution  of  V  given 
U  be  the  Beta  distribution  with  parameters  2  and  2,  shif^d  so  that  it  is  centered  at  U, 
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and  let  X  =  (U,V).  If  we  start  successive  substitution  sampling  at  X0  =  (0,0),  then  it  is 
easy  to  see  that  U\  must  be  in  the  interval  (  —  1,1),  and  in  fact,  the  value  of  U  can  change 
by  at  most  one  unit  at  each  iteration.  Thus,  the  distribution  of  Un  is  concentrated  in  the 
interval  (— n,n).  In  particular, 

sup  J P{Un  £  C  |  Uo  =  0)  -  i/(C)|  >  y{(-oo,  -n)  U  (n,  oo)}  ~  (^j  i, 

so  that  the  rate  of  convergence  cannot  be  geometric.  The  distribution  v  could  have  been 
taken  to  be  any  distribution  whose  tails  are  “thicker  than  those  of  the  exponential  distri¬ 
bution”,  and  in  fact,  we  can  make  the  rate  of  convergence  arbitrarily  slow  by  taking  the 
tails  of  v  to  be  sufficiently  thick. 

It  is  not  difficult  to  see  that  if  we  select  the  starting  point  at  random  from  a  bounded 
density  concentrated  in  a  neighborhood  of  the  origin,  then  this  example  provides  a  simple 
counterexample  to  Theorem  3  of  Tanner  and  Wong  (19S7),  which  asserts  convergence  at  a 
geometric  rate. 

This  paper  is  organized  as  follows.  Section  2  gives  the  proofs  of  Theorems  1  and  2  and 
also  states  and  proves  a  theorem  that  gives,  under  additional  conditions,  convergence  at  a 
geometric  rate.  Section  3  discusses  briefly  some  issues  to  consider  when  deciding  how  to 
use  the  output  of  the  Markov  chain  to  estimate  r  and  functionals  of  t. 

2  Ergodic  Theorems  for  Markov  Chains  on  General 
State  Spaces 

The  proofs  of  Theorems  1  and  2  rest  on  the  familiar  technique  of  regenerative  events  in 
a  Markov  chain.  See  for  instance  Athreya  and  Ney  (1978).  In  Section  2.1,  we  prove 
Proposition  1  in  which  we  assume  that  the  set  A  is  a  singleton  a,  so  that  p  is  the  degenerate 
probability  measure  on  {a}.  We  also  assume  that  the  singleton  a  is  an  aperiodic  state,  a 
condition  which  is  stated  more  fully  as  Condition  (C)  in  Proposition  1  below.  Under  these 
simplified  assumptions  we  establish  the  ergodicity  of  the  Markov  chain. 

In  Section  2.2  we  establish  Theorem  1  as  follows.  In  Proposition  2  we  show  that,  when 
n0  =  1,  under  the  conditions  of  Theorem  1,  a  general  Markov  chain  can  be  reduced  to  one 
satisfying  the  above  simplified  assumptions  of  Proposition  1.  This  is  done  by  enlarging 
the  state  space  with  an  extra  point  A  and  extending  the  Markov  chain  to  the  enlarged 
space.  We  then  show  that  this  singleton  set  {A}  satisfies  the  simplified  assumptions  of 
Proposition  1.  From  this  it  follows  that  the  extended  chain  is  ergodic.  After  this  step  we 
deduce  that  the  original  chain  is  also  ergodic.  Finally,  we  show  how  the  condition  n0  =  1 
can  be  discarded  under  the  aperiodicity  condition  (1.6). 

In  Section  2.3  we  prove  Theorem  2  which  asserts  convergence  of  averages  of  transition 
functions  and  averages  of  functions  of  the  Markov  chain,  without  the  aperiodicity  assump¬ 
tion  (1.6).  The  key  step  in  the  proof  is  to  recognize  that  the  Markov  chain  observed  at  time 
points  which  are  multiples  of  n0  is  an  embedded  Markov  chain  satisfying  the  conditions  of 
Proposition  2  and  with  a  stationary  probability  distribution  7r0  which  is  the  restriction  of  tt 
to  the  set  Aq  defined  by  (2.30).  In  the  Markov  chain  literature,  mean  ergodicity  is  usually 
obtained  as  an  elementary  consequence  of  ergodicity  in  the  aperiodic  case  and  the  existence 
of  a  well-defined  period  and  cyclically  moving  disjoint  subclasses.  Our  proof  circumvents, 
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in  a  way  which  we  believe  is  new,  the  need  for  well-defined  periodicity  and  cyclically  moving 
disjoint  subclasses. 

2.1  State  Spaces  with  a  Distinguished  Point 

Fix  a  point  a  in  X.  For  convenience,  we  will  refer  to  this  point  as  the  distinguished  point. 
We  will  often  write  just  a  for  the  singleton  set  {a}.  The  number  of  visits  to  a,  A^n({o}) 
and  ./V({a}),  will  be  denoted  simply  by  N„  and  N,  respectively.  The  first  time  the  chain 
visits  a  after  time  0,  namely  T({a}),  will  be  denoted  simply  by  T.  Let 

C0  =  {x  :  PX(T  <  oo)  =  l}  =  {2  :  PX{T  =  oc)  =  O}  (2.1) 

and 

X0  =  {x  :  PX(T  <  00)  >  0}  (2.2) 

be  the  set  of  all  states  x  from  which  a  can  be  reached  with  probability  1  and  the  set  of  all 
states  from  which  a  is  accessible,  respectively. 

Definition  1  The  state  a  is  said  to  be  transient  if  PQ(T  <  00)  <  1  and  recurrent  if 
Pa(T  <  00)  =  1.  The  state  a  is  said  to  be  positive  recurrent  if  Ea(T )  <  00. 

Proposition  1  Suppose  that  the  transition  function  P(x,C)  satisfies  the  following  condi¬ 
tions: 

(A)  ir  is  a  stationary  probability  measure  for  P. 

(B)  :  PX(T  <  00)  >  0}  =  1. 

Then 

1  71 

7r(C0)  =  1  and  sup  I—  Y''  PJ(x,  C)  —  7r(C)|  — ►  0  for  each  x  €  C0.  (2.3) 

ceB'nfz 0  1 

Suppose  in  addition  that 

(C)  g.c.d.{n  :  Pn(a,a)  >  0}  =  1. 

Then 

sup  |Pn(i,  C )  —  7r(C) |  — *  0  for  each  x  €  Co.  (2.4) 

C€0 

The  proof  of  this  proposition  is  given  after  the  remark  following  the  proof  of  Lemma  3. 

Lemma  1  If  Conditions  (A)  and  (B)  of  Proposition  1  hold,  then  x(a)  >  0  and  a  is  positive 
recurrent. 

Proof  We  first  establish  that  7r(a)  >  0.  From  Condition  (A)  it  follows  that  7r(a)  = 
f  7r(dx)Pn(x,  a)  for  n  =  1, 2, . . .,  and  hence 

n7r(a)  =  J  ir(dx)Gn(x,  a)  (2.5) 

for  all  n.  The  Monotone  Convergence  Theorem  and  Condition  (B)  imply  that 

limn7r(a)  =  J  x(dx)G(x,  a)  >  0,  (2.6) 
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and  hence  7t(q)  >  0. 

Let  the  Markov  chain  start  at  some  x  £  X.  Let  7\  =  T  and  Tk  =  inf{n  :  n  > 
Tk-U  Xn  =  a}  for  k  =  2, 3, . . with  the  usual  convention  that  the  infimum  of  the  empty 
set  is  oo.  If  N  <  oo  then  only  finitely  many  TVs  are  finite.  If  N  =  oo  then  all  the 
Tk's  are  finite.  In  the  latter  case,  the  Markov  chain  starts  afresh  from  a  at  time  Tk,  and 
hence  Tk  —  Tk~ i,  k  =  2, 3, . . .  are  independent  and  identically  distributed  with  distribution 
H  where  H{n)  =  Pa(T  <  n).  These  facts,  the  Strong  Law  of  Large  Numbers  and  the 
inequality 


imply  that 


Nn  ^  Nn  ^  Nn 

(2.7) 

Tn„  +  1  ~  n  “  7Vn 

£ 

1 

b: 

- 

H 

II 

(2.8) 

with  probability  1  under  Px  for  each  x  €  X. 

From  the  Bounded  Convergence  Theorem,  it  follows  that 


-Gn(x,a)  =  £x(-A'n) 
n  vn  ' 


1 

EJJ) 


PX(N  =  oo) 


for  each  x  €  X. 


Divide  both  sides  of  (2.5)  by  n,  take  limits  and  compare  with  the  above.  By  using  the 
fact  that  7T  is  a  probability  measure  and  applying  the  Bounded  Convergence  Theorem,  we 
obtain 

"(a)  =  Jjjr}  J  *(dx)PAN  =  CO).  (2.9) 

Since  x(a)  >  0,  it  follows  that  /  n(dx)Px(N  =  oo)  >  0  and  EQ(T )  <  oo,  and  hence  a  is 
positive  recurrent.  0 


The  arguments  leading  to  the  conclusion  7r(a)  >  0  in  the  above  lemma,  which  were 
based  on  (2.5)  and  (2.6),  and  did  not  use  the  full  force  of  Condition  (B).  The  following 
corollary  records  that  fact  and  will  be  used  later  in  this  paper. 

Corollary  1  Let  it  satisfy  (A)  of  Proposition  1  and  let  E  €  B  be  such  that 

7r({x  :  G(x,E)  >  0})  >  0. 

Then  tt(jF)  >  0. 

The  fact  that  a  is  positive  recurrent  gives  us  a  way  of  obtaining  an  explicit  form  for  a 
finite  stationary  measure  v  and  show  that  it  must  be  a  multiple  of  ir. 


Lemma  2  Let  a  be  recurrent.  Let 

■  T-l 


(C)  =  Eq(Z  I(Xi  €  C))  =  f  Pa(Xn  eC,T>n) 

^  jssO  '  n=0 


(2.10) 


be  the  expected  number  of  visits  to  C  between  consecutive  visits  to  a,  beginning  from 
a.  Then  v  is  a  stationary  measure  for  P(-,-)  with  i'(A'q)  =  0,  and  is  unique  up  to  a 
multiplicative  constant;  more  precisely, 

H-)  =  J  P(x,-)v(dx), 
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and  if  v'  is  any  other  stationary  measure  with  v'(Xq)  =  0,  then 


i /(C)  =  v\a)v{C)  for  all  C  €  B. 

The  measure  v  also  has  the  property 

v(CcQ)  =  0. 

Suppose  that  Conditions  (A)  and  (B)  of  Proposition  1  hold,  so  that  a  is  positive  recurrent 
and  7r  is  a  stationary  probability  measure  for  P(-,  •)  with  tt(X£)  =  0.  Then 

v(X)  =  Eq(T)  <  oo 


and 

wc)  =  l 

is  the  unique  stationary  probability  measure  with  x(Co )  =  1. 


Proof  Since  =  a)  =  1  we  have  v(a)  =  1  =  Pa(T  <  oo).  To  show  that 

u(Cq)  =  0,  notice  that  for  all  n 

0  =  Pa(T  =  oo)  =  Ea{Pa(T  =  oo  |  XuX2,...,Xn))  =  Ea(PXn(T  =  oo )I(T  >  n)). 
From  this  it  follows  that 

0  =  Pa{PXn(T  =  oo )I(T  >  n)  >  0}  =  eC^T>  n} 

for  each  n.  From  the  definition  of  v  in  (2.10)  it  now  follows  that  v(Cl)  =  0.  We  now  show 
that  v  is  a  stationary  measure.  Let  f(x)  be  a  bounded  measurable  function  on  (X,B). 
Then 

L(ix)f(x)  =  f^Ea{!{Xn)I(T>n)) 

J  n=0 

=  /(«)  +  E  (E'(f(XMT  >7,-1))-  Ea(f(X„)I(T  =  n))) 

=  /(a)  +  f)  Ea(Ea{f(Xn)I(T  >  n  -  1))  |  Xo,X1, . . .  ,Xn^) 

n=l  '  ' 

-  E  E0[f(Xn)I(T  =  n)) 

n= 1 

=  f(o)  -  f(o)Pa(T  <°=)  +  E  E.(ex_M(X.))1(T  >  n  -  1)) 

71=1 

=  E  Eq^J P(X„-i,dy)f(y)I(T  >  n  —  1)) 

=  E  E°  (  /  P (*».  dymV)l(T  >  „)) 

71  =  0  ' 

=  LXL^(dx)p{x’dv))fiv)- 
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where  the  fourth  equality  in  the  above  follows  from  the  Markov  property.  This  shows  that 
v  is  a  stationary  measure. 

Let  v'  be  any  other  stationary  measure  for  -P(v)  satisfying  v'{Xc0)  =  0.  Fix  C  €  B. 
Then  for  C  such  that  a  £  C, 

v\C )  =  J  i/( dx)P(x,C) 

=  v'(q)Pq{X\  €  C)  +  /  v\dx)Px{X,  6  C) 

Jx^Q 

=  u'(a)Pa(Xi  €  C)  +  /  /  €  C) 

=  j/(a)Pa(X,  eC)  +  J  u\dy)Pv(X 2  €  C,  T  >  1) 

=  "'(<*)  E  €C,  T  >  m  -  1)  +  /  u'{dy)Py{Xn+1  €  C,  T  >  n) 

m  =  l 

>  i/(a)  E  Pa(Xm  €  C,  T  >  m  -  1) 

m= 1 

>  i/(a)  E  ^»(A'm  €  C,  T  >  m) 

m—\ 

for  each  n.  In  the  last  line  above  we  used  the  fact  that  { Xm  €  C,  T  >'  m  —  1}  = 
{ Xm  6  C,  T  >  m],  since  a  g  C.  Thus  u'(C )  >  i/(a)v(C)  for  all  C  since  i/(a)  =  1.  Let 
A(C)  =  u'(C)  —  v'(a)u{C).  Then  A  is  a  stationary  nonnegative  measure  and  A(a)  =  0  since 
v{a)  =  1.  Thus 

0  =  A(a)  =  J  Gn(x,a)\(dx)  — *■  J  G(x,a)\(dx) 

by  the  Monotone  Convergence  Theorem.  Therefore  0  =  A(Ao)  since  the  integrand  above, 
G(x,/1),  is  positive  for  all  x  €  Xo  in  view  of  Condition  (B).  This  proves  that 

v'{C)  =  ✓(«MC),  (2.11) 

which  shows  that  v  is  the  unique  stationary  measure  satisfying  =  0,  up  to  a  multi¬ 

plicative  constant. 

We  now  assume  that  a  is  positive  recurrent.  Since  o7(A„  €  A)  =  21,  we  have 
v{X)  =  Ea{T )  <  oo.  Let  ir  be  a  stationary  probability  measure  satisfying  n{Xo)  =  0. 
From  (2.11),  we  have  the  equality 

tt(C)  =  ir(aMC'). 

From  the  earlier  part  of  this  proof  it  now  follows  that  7 r  is  the  unique  stationary  probability 
measure.  0 

One  can  consider  general  measurable  functions  f(x)  with  /  |/(y)l7r(dy)  <  instead 
of  I(x  =  a)  as  was  done  in  Lemmas  1  and  2.  By  reworking  inequalities  (2.7)  and  (2.8), 
showing  that  the  end  effects  can  be  ignored  and  by  using  the  law  of  large  numbers  for 
averages  of  i.i.d.  random  variables,  we  can  obtain  the  following  corollary. 
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Corollary  2  Let  Conditions  (A)  and  (B)  of  Proposition  1  hold.  Let  /(x)  be  a  measurable 
function  with  f  \f{y)\x{dy)  <  oc.  Let 


Then 

and 


J  f(x)d  tt(x)}  = 


*  (A/)  =  1 


—  Y.  Ex{f(Xj))  — >  f  z(dx)f(x)  for  [tt]- almost  all  x. 
n  j= i  J 


Proof  Using  the  definitions  of  the  hitting  times  {7*}  of  a  defined  in  Lemma  1,  define 


min(n,7j)  T,+l  T,+l  n 

C  =  E  /(*(),  K-  E  H'=  E  l/(*i)l  and  IV  =  E  /(*,')• 

J— 1  J— Tr  +  1  j=Tr  +  l  i=TNn+l 

From  the  simple  bounds 

M  <  E  l/P0)l  and  \W\  <  <  Kmaxw  1? 

and  the  fact  that  for  each  x  €  Co,  under  Px,  Vj,  V2, . . .  are  i.i.d.  random  variables  with  mean 
Ea(T)Jf(x)n(dx),  and  EX{TX)  <  00,  we  get  Px(U/n  ->  0)  =  Px(W/n  ->  0)  =  1.  Thus, 


1  JL  f/  N  1  W 

n  n  n  J\n  n 


J  f(x)ir(dx) 


as  n  — >  00  for  [7r]-almost  all  x.  0 

To  get  the  convergence  assertions  (2.3)  and  (2.4)  of  Proposition  1  we  need  the  following 
lemma  from  renewal  theory. 

Lemma  3  Let  {p„,  n  =  0,1,...}  be  a  probability  distribution  with  p0  =  0  and  let 
P  =  Pn  <  00.  Let  { rji ,  i  =  1,2,.. .}  be  a  sequence  of  i.i.d.  random  variables  with  distri¬ 
bution  {p„}.  Let  Sq  =  0,  Sk  =  £n=j  Vj  for  n  >  1.  Define  {pW,  n  =  1,2, . . .},  k  =  1,2, . . . 

recursively  by  pj,1)  =  p„,  pW  =  £0<jS„  pf~l)P*-i  =  P(Sk  =  n).  For  n  =  0,1,.. .,  deSae 


p.-Erf'-  (2.12) 

k=0 

Then 

(a)  r„  is  the  unique  solution  of  the  so-called  renewal  equations 

n 

r0  =  l ,  rn  =  Ylrn-jPj,  n  =  1,2,.... 

Furthermore, 


12 


(b)  £  E’U  rj  —  J  as  n  -»  oo. 

If  the  additional  condition  g.c.d.{n  :  pn  >  0}  =  1  holds,  then 

(c)  r„  ♦  ~  as  n  — ►  oo. 

Proof  It  is  easy  to  establish  (a)  by  direct  verification.  To  prove  Part  (b),  we  note  that 
!£j=orj  =  Ulto  P(Sk  <  n)  =  £(./V(n))  where  JV(n)  =  sup{/: :  Sk  <  n}.  By  the  Strong  Law 
of  Large  Numbers  and  the  inequalities 


S/V(n)  <  n  <  *?N(n)+l, 


it  follows  that 


N(n) 

n 


—  w.p.  1. 


Part  (c)  is  the  well  known  discrete  renewal  theorem  for  which  there  are  many  proofs 
in  standard  texts,  some  of  which  are  purely  analytic  (see,  e.g.  Chapter  XIII.  10  in  Feller 
(1950))  and  others  are  probabilistic  (see  e.g.  Chapter  2  of  Hoel,  Port,  and  Stone  (1972)). 

o' 


Remark  1  The  tail  behavior  of  the  probability  distribution  {pn}  affects  the  rate  of  con¬ 
vergence  of  |rn  —  L|.  Here  is  an  example  of  a  result  on  rates  of  convergence.  The  following 
are  equivalent: 

^exp(nto)  pn  <  oo  for  some  t0  >  0.  (2.13) 

^exp(nto)  \rn  -  rn+1|  <  oo  for  some  t0  >  0.  (2.14) 

|r„-i| -£>(/>■).  (2.15) 

ft 

When  these  conditions  hold,  it  can  be  asserted  that  exp  (—to)  <  p  <  1.  Similarly,  if 
£  nppn  <  oo  for  some  p  >  0  then  it  is  known  that  there  is  a  6  with  0  <  0  <  p  such  that 

| rn  -  -|  =  0{n~e). 
ft 

See  e.g.  Asmussen  (1987)  or  Stone  (1965). 

Proof  of  Proposition  1  Let  D  be  the  collection  of  all  measurable  functions  /  on  (X,  B) 
with  supy  |/(y)|  <  1.  Let  /  €  D.  Then  for  any  x  €  X, 


£.(/(*»))  =  E,(/(Xn)I(T  >  n))  +  £  P,{T  =  t)£.(/(X.-*)),  "  =  0,1 .  (2.16) 

k=0 

Let  vn  =  E0(f( A'„)),  an  =  Ea(f(Xn)I(T  >  n))  and  p„  =  Pa(T  =  n),  n  =  0,1,....  Note 
that  un  and  an  also  depend  on  the  function  /  while  pn  does  not.  Putting  x  =  a  in  (2.16) 
we  get  the  important  identity 

n 

Vn  =  On  +  ]£  PkVn-k ■  (2.17) 

fc= 0 


13 


It  is  not  difficult  to  check  that  vn  =  52k=oakrn-k  is  the  unique  solution  to  (2.17)  where  rn 
is  as  defined  in  (2.12).  Thus 


^  E  vi  =  z:  fl  £  akrj-k  =  akRn-k  =  £  ak^^-I{k  <  n ) 


nU 


n  j=0  Jt=0 


n 


k= o 


fc=0 


n 


where  =  D"=0rj.  Also, 


«.(£&? /W))  //a- 

£„(T) 


-E’c.(T) 

Thus  for  f  (z  D, 

- f fd*\  ^ 


=  //^. 


i— o 


k=0 
oo 


<  2  KI  +  C£lflil)  sup 

j—Q  n—m<k<n 


j=m 

oo 


<  2  £  P„(T  >  j)  +  (B.(r))  sup 

n— m<Jfc<n 


j=m 


Pfc  _  1 

n  /z 

Rk_i 

n  fi 


for  any  positive  integer  m.  Note  that  for  fixed  m,  supn_m<fc<n  —  ^|  — >Oasn- >oo  from 
Part  (b)  of  Lemma  3,  and  Pa{T  >  j)  — *  0  as  m  — ►  oo,  since  a  is  positive  recurrent. 

By  first  fixing  m  and  letting  n  -+  oo,  and  then  letting  m  — ►  oo,  we  get 


0  uniformly  in  /  as  n 


oo. 


(2.18) 


Let  x  €  C0.  Let  wn  =  Ex(f(Xn)),  6n  =  Ex(f(Xn)I(T  >  n))  and  <?„  =  PX(T  =  n). 
Note  that  for  a  fixed  x,  6n  — »  0  as  n  — ►  oo,  uniformly  in  /  and  that  <jn  is  a  probability 
sequence  which  does  not  depend  on  /.  Using  equation  (2.16)  once  again,  we  see  that  xvn 
satisfies  the  equation 

n 

wn  =  bn  +  Y,9kVn-k,  n  =  0,1,....  (2.19) 

k= 0 

Using  (2.18),  we  conclude  that 


1"  1"  "  i»-‘  r 

“X>i  =  “  X>i  +  X>*“  E  ^  -  / /rf7r 

n  j=0  71  i=0  i=0  n  j=0  ^ 


uniformly  in  /  as  n  — +  oo.  This  establishes  (2.3)  of  Proposition  1. 

We  now  use  Condition  (C).  Under  this  assumption,  g.c.d.{n  :  Pn(a,a)  >  0}  =  1,  and 
thus  g.c.d.{n  :  pn  >  0}  =  1;  see  for  instance  see  the  lemma  on  p.  29  of  Chung  (1967).  Thus, 
from  Part  (c)  of  Lemma  3  we  have  rn  — ►  Z.  Repeating  the  arguments  leading  to  (2.18)  and 
(2.19)  with  this  stronger  result  on  rn,  we  see  that  vn  — ►  /  fdfi  and  wn  — »  /  fdfi  uniformly 
in  /  €  D.  This  proves  conclusion  (2.4)  and  completes  the  proof  of  Proposition  1.  0 
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2.2  Proof  of  Theorem  1  for  General  Markov  Chains 

We  will  now  establish  Theorem  1  under  the  condition  that  the  no  appearing  in  (1.5)  is  1. 
This  is  stated  as  Proposition  2  below,  and  though  it  is  technically  weaker,  its  proof  contains 
the  heart  of  the  arguments  needed  to  establish  Theorem  1 . 

Proposition  2  Suppose  that  A  €  B  and  let  p  be  a  probability  measure  on  (X,B)  with 
p(A)  =  1.  Suppose  that  the  transition  function  P(x,C )  of  the  Markov  chain  {Xn}  satisfies 
(1.1),  (1.4),  and  (1.5)  where  the  no  appearing  in  (1.5)  is  equal  to  1.  Then  there  is  a  set  Do 
such  that 

it  (Do)  =  1  and  sup  \Pn(x,  C)  —  7r  (C)  |  — »  0  for  each  x  €  Do.  (2.20) 

C€S 


Proof  The  proof  consists  of  adding  a  point  A  to  X ,  defining  a  transition  function  on  the 
enlarged  space  and  appealing  to  Proposition  1. 

Consider  the  space  (X,B),  where  X  =  XU  {A}  and  B  is  the  smallest  cr-field  containing 
B  and  {A}.  Let  e"  =  e/2,  and  define  the  transition  probability  function  P(x,C)  on  (X,B) 
by 


P(x,C) 

P(x,C)-?p(C) 

e* 


l  Ia  P(dz)P(z,  C) 


if  x  6  X\  A,  C€B 
if  x  €  A,  C  €  B 
if  x  6  A,  C  =  {A} 
if  i  =  A,  C  €  B 


(2.21) 


Also,  define  the  probability  measure  it  on  (X ,  B)  by 


f  ir(C)  -  e*p(C)ic(A) 

1 


if  CeB 
if  C  =  {A} 


(2.22) 


We  will  now  show  that  the  transition  probability  function  P(x,  C)  together  with  it  and  the 
distinguished  point  A  satisfy  Conditions  (A),  (B),  and  (C)  of  Proposition  1. 

If  i  €  A  then  P(x,  A)  =  e*  >  0,  so  that  G(x,  A)  >0.  If  x  6  X  \  A,  we  have 
G(x,  A)  >  fA  G(x,  dy)P(y ,  A)  >  t'G(x,  A)  >  0  in  view  of  (1.4).  Finally  P(A,  A)  =  e*  >  0. 
This  verifies  both  Conditions  (A)  and  (C)  of  Proposition  1. 

Next,  for  C  €  B,  we  have 


J  it(dx)P(x,C)  =  J  (ir(dx)-emp(dx)ir(A))P(x,C)  +  emir{A)  Jxp(dx)P(x,C) 
=  J^7r(dx)P(x,C) 

=  ]x  n(dx)  (P(x,  C)  -  t  p(C)I(x  €  A)) 

=  x(C)-e-p(C)*(A) 

=  *(C). 


When  C  =  {A},  we  have 

J  ft (dx)P(x,  A)  =  J  (n(dx)  -  C p(dx)n(A)j  ( tmI(x  €  A))  +  e*7r (A)  J*  p(dx)t’l(x  €  A) 

=  C*?r(A) 

=  *(A). 
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This  verifies  Condition  (B)  of  Proposition  1. 

Thus  Proposition  1  implies  that  there  exists  a  set  Dq  €  B  such  that 

tt(D0)  =  1  and  sup  |Pn(x,  C)  -  f(C)\  -►  0  for  each  x  €  Dq.  (2.23) 

Ce8 

To  translate  (2.23)  as  a  result  for  Pn(x,C)  we  define  a  function  t>(x,C)  on  X  x  B  by 

(I(xeC)  iixeX 
v(x’C)-{p(C)  if  x  =  A 

We  may  view  v(x,  C )  as  a  transition  function  from  X  into  X.  The  following  lemma  shows 
how  one  can  go  from  Pn(x ,  C )  to  Pn{x,  C )  and  back.  The  proof  of  Proposition  2  is  continued 
after  Lemma  5. 

Lemma  4  The  transition  functions  P(x,  C),  P(x ,  C)  and  u(x,  C)  and  the  probability  mea¬ 
sures  7 r  and  7f  are  related  as  follows: 

P(x.C)=  f  P(x.  dv)v(y,  C)  forx^X.C^B,  (2.24) 


P(x,  C)  —  P(x *  dy)v(y,  C)  for  x  6  X,  C  6  B, 

P(x,C )  =  f  v(x,dy)P{y,C)  for  x  6  X,  C  €  B, 

J  X 

Pn(x,  C)  =  /  Pn(x,  dy)v(y,  C)  for  x  €  A\  C  eB, 

and  r 

7 r(C)  =  /  7r(dx)t>(x,C)  for  C  €  B. 

J  X 

Proof  These  are  proved  by  direct  verification.  For  x  €  X,  C  €  B,  we  have 

J  P(x,dy)v(y,C)  =  J^P(x,dy)I(y  €  C)  +  t  I{x  €  A)p(C) 

=  P(x,  C)  -  e*/(x  €  A)p(C)  +  eV(x  €  A)/>(C) 
«  P(x,C). 


(2.25) 

(2.26) 


(2.27) 


Similarly,  for  x  €  X,  C  €  B,  we  get 


Jx  V{X' dy)P{y' C)  ~  {  J p(dy)P(y,  C )  =  P(A, 


ifx€  X 
C)  if  x  =  A 


We  prove  (2.26)  by  induction  on  n.  For  n  =  1,  this  is  just  (2.24).  Assume  that  (2.26)  has 
been  proved  for  n  —  1. 

For  x  €  X,  C  €  B,  we  have 

f  Pn(x,dy)v(y,  C)  =  /  P"_1(x,  dz)P(z,  <fy)u(y,  C) 

74*  J*,y€X 

_  f  Pn-1(x,dz)u(2,du>)P(u;,<fy)v(j/,C) 

_  f  Pn-'l(x,dz)v(z,dw)P(w,C) 

Jz£X,w£X 

=  f  Pn~1(x,dw)P(w,  C) 

J  w€X 

=  Pn(x,C), 


where  the  second  inequality  follows  from  (2.25),  the  third  follows  from  (2.24),  and  the 
fourth  from  the  induction  step. 

Finally,  for  C  €  S,  we  notice  that 

H(dx)v(x ,  C)  =  J*  (?r (dx)  —  e*ir(A)p(dx)Jv(x,  C)  +  e’x (A)p(C)  =  7r(C). 

This  completes  the  proof  of  the  lemma.  0 

The  next  lemma  shows  that  f  dominates  p. 

Lemma  5  Let  C  €  B.  Then 

7c (C)  =  0  implies  that  p(C )  =  0.  (2.28) 

Proof  From  the  careful  choice  of  e*  =  c/2  used  to  define  P(x,C)  in  definition  (2.21),  we 
have 


P(x,C)  =  P{x,C )  —  tmp{C)  >  em p(C)  whenever  x  €  A  and  C  €  B.  (2.29) 

Applying  Lemma  2  to  the  Markov  chain  {X,B,  P(-,  •)}  which  has  a  stationary  distri¬ 
bution  #(•)  we  get,  for  any  C  €  B, 

*{C)  -  £ik)^(|(/(A'“€CK(fi>n))) 

>  (£(/(*.  €  C)/(f4  >  €  /I))) 

=  (£(/(?„  >  n  -  1)/(X„-,  €  A)P(X„  €  C  |  *„-,))) 

>  (f  (/(f4  >  n  - 1  )/(*,.,  e  4))) 

-  nk)eV(C)£a(|(/<fi>n)/(;f”6A))) 

=  c>(C)#(A) 

=  c>(CK(A)(l  -  cV(A)). 

The  equality  in  the  third  line  follows  from  the  fact  that  {Xn  €  C,  T&  >  n  —  1}  = 
{ Xn  €  C,  T&  >  n},  since  A  C,  and  the  inequality  in  the  fourth  line  follows  from  (2.29). 
Now  since  A  is  recurrent  for  the  chain  {X,  B,  P(-,  •)},  we  have  f  (A)  =  c*7 t(A)  >  0,  and  this 
proves  (2.28).  0 

Completion  of  the  proof  of  Proposition  2  Let  Do  =  Do  —  A.  From  (2.23),  (2.26), 
and  (2.27),  we  have  7t(D0)  =  1,  and 

sup  |Pn(a:,C)  -  7r(C)|  =  supl  /  Pn(x,dy)v(y,C)—  f  f (dy)v(y,C)\  — ♦  0  for  each  i  €  Do- 
C$B  CtB'JX  JX  1 
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This  means  that 


f(X  -  D0)  =  *{X  -  Do)  =  0. 

From  Lemma  5,  it  follows  that 

p{X  -  Do)  =  0. 

Now,  from  the  definition  of  tt(-)  in  (2.22), 

ir(X  -  Do)  =  HX  -  Do)  +  t’p(X  -  Do)it(A)  =  0. 

This  completes  the  proof  that  tt(Dq)  =  1  and 

sup  jPn(x,C)  —  7r(C) |  — ♦  0  for  all  x  £  Do. 
ceB 

0 

We  now  drop  the  condition  no  =  1  and  prove  Theorem  1. 

Proof  of  Theorem  1  Let  M.  =  jm  :  there  is  an  cm  >  0  such  that  inf  Pm(x,  •)  > 

v-  x&A  ' 

Then  g.c.d.(Af)  =  1.  Fix  an  m  £  M..  From  a  standard  result  on  g.c.d.’s  of  sets  of  positive 
integers  (see  e.g.  Problem  2  on  p.  77  of  Karlin  and  Taylor  (1975)),  there  is  an  integer  L 
such  that  M.  will  contain  all  integers  larger  than  L.  This  together  with  Condition  (1.4) 
shows  that  £fc>1  Pkm{x,  A)  >  0  for  each  x  £  X.  This  means  that  the  Markov  chain  viewed 
only  at  times  which  are  multiples  of  m  satisfies  (1.4)  and  (1.5)  with  n0  =  1.  Thus  from 
Proposition  2  there  is  a  set  D0  such  that  ir(D0)  =  1,  and  for  any  m  £  M, 

sup  JPfcm(x,  C)  —  7r(C) |  — »  0  for  x  £  Do  as  k  —*  oo. 

C€» 

Next,  there  is  a  finite  subcollection  mj,m2,...,mr  £  M  and  integers  ai,a2,  ...,ar  such 
that  2KKr  a,m,  =  1.  This  is  generally  established  during  the  proof  of  the  standard 
fact  on  the  g.c.d.  of  sets  of  integers  quoted  above  (see  e.g.  Problem  2  on  p  77  of  Karlin 
and  Taylor  (1975)).  Permute  the  indices  if  necessary  and  assume  that  ai  >  0, ...,a,  > 

0,  -a3+ 1  =  bs+1  >  0, . . . ,  — ar  =  br  >  0,  so  that  N  -  M  =  1  where  N  =  £i<t<*  a,m,  and 

M  =  HJ<,<r  6,m,  >  0.  Any  positive  integer  K  can  be  written  as  K  =  k(M2  +  N)  +  r  where 
0  <  r  <  M 7  +  N  and  k  >  0.  Writing  r  =  r(N  —  M)  we  have 

K  =  N(k  +  r)  +  M(kM  —  r)  =  (k  +  r)  ^  a»m»  +  —  r)  ^  km 

1<«<»  »«<r 

Note  that  kM  —  r  >  0  when  k  >  4 M.  When  K  — »  oo,  the  integer  k  defined  above 
tends  to  oo,  as  do  the  multipliers  k  +  r  and  kM  —  r.  Since  for  [7r]-almost  every  x 
we  have  supCeB  |Pfem,(x, C)  —  7t(C7)|  — ►  0  &s  k  — *  oo  for  i  =  1,2, ...r,  it  follows  that 
|  f  Pkm'(x,dy)fk(y)  -  ff(y)i r(dy)\  ->  0  for  *  =  1,2,. . .  ,r  if  fk(y)  -»  f(y)  for  [7r]-almost 
every  y.  Therefore,  as  K  — »  oo 

sup  |PA(x,C)  -  7r(C)|  = 
cee 

sup  |  j  P(fc+r)0imi(x,dy1)P(fe+r)02m2(y1,rfy2)  •  •  •  P{kM-r)brmr(yr-uC)  -  *{C)\  -+  0. 

0 
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Remark  2  The  proof  above  also  establishes  the  following  slight  extension  of  Theorem  1. 

Theorem  1'  Suppose  that  tt  is  a  stationary  probability  measure  for  a  Markov  chain  with 
transition  function  P(x,-).  Let  M.'  be  the  set  of  all  integers  m  >  1  such  that  there  exist 
em  >  0,  Am  £  B,  and  a  probability  measure  pm  with  p(Am)  =  1  such  that 

:  Px(T(Am )  <  oo)  >  o}  =  1, 

Pm(x,  •)  >  £mpm(')  for  each  x  €  Am 

and 

Y  Pkm(x,  Am )  >  0  for  [7r]-a lmost  every  x. 

k>  1 

Then  the  conclusion  (1.7)  of  Theorem  1  holds  if  g.c.d.(M')  =  1. 


2.3  Proof  of  Theorem  2  for  General  Markov  Chains 

As  mentioned  earlier,  the  key  to  the  proof  of  Theorem  2  is  to  recognize  an  embedded 
Markov  chain  which  satisfies  the  conditions  of  Theorem  1.  The  proof  of  Theorem  2  is 
completed  after  Lemma  9. 

Let  Ym  =  Xmno ,  7Ti  —  0,1,...  and  put  Q(x,  C)  =  P"°(z,  C)  for  x  €  X  and  C  €  B.  The 
subsequence  {Vo,  Vi,...}  is  a  Markov  chain  with  transition  probability  function  Q(x,C) 
and  we  will  call  it  the  embedded  Markov  chain.  Define 

Ar  =  {z  :  Pmn°~r(x,  A)  >  0  j,  r  =  0, 1, . . . ,  rz0.  (2.30) 

m=l 

Since  P”°(z,  A)  >  t  for  all  z  €  A,  one  can  also  define  Ar  by 

Ar  =  ■  Y  Pmno~r{x,A )  >  o}  for  any  k  >  1. 

i.e.  Ar  is  the  set  of  all  points  from  which  A  is  accessible  at  time  points  which  are  of  the 
form  mn0  —  r  for  all  large  m  and  Ao  is  the  set  of  all  points  from  which  A  is  accessible  in 
the  embedded  Markov  chain. 

Lemma  6  below  shows  that  the  embedded  Markov  chain  satisfies  the  conditions  of 
Theorem  1  with  the  restriction  of  7r  to  Ao  as  its  stationary  probability  measure. 

Lemma  6  Under  the  conditions  of  Theorem  2, 


7t(A0)  >  0. 


Let 


*o  (C)  = 


7 r(C  fl  Ao) 
*-(A0) 


The  embedded  Markov  chain  {Vo,  Vi, . . .}  satisfies  the  conditions  (1.4)  and  (1.5)  of  Theo¬ 
rem  1  with  7To  as  a  stationary  probability  measure  and  with  the  no  appearing  in  (1.5)  equal 
to  1. 
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Proof  Condition  (1.1)  states  that  7r({x  :  PX(T(A)  <  oo)  >  0})  =  1.  Just  the  fact  that 
this  probability  is  positive  and  condition  (1.4)  allow  us  to  use  Corollary  1  to  conclude  that 
k(A)  >  0.  Condition  (1.5)  implies  that  A  C  Ao.  Thus  7r(A0)  >  0  and  hence  v0  is  a  well 
defined  probability  measure.  Clearly, 


7T (C)  =  J  rr{dx)Q{x,C)  for  all  C  €  B, 


n0(A0)  =  1 


(2.31) 

(2.32) 


<P(x,  •)  >  ep(-)  for  all  x  €  A. 


(2.33) 


Notice  that 


£  Q"(x,A)  =  f  Q(x,dy)  £  Qm(y,A)  =  f  Q(x,dy)  £  Qm(y,A). 

J  X  ,  ^  J  An  i  ^  _  .. 


2<m<oo 


1  <m<oc 


1  <m<oo 


Hence  Q(x,  Ao)  >  0  implies  that  Yl2<m<co  Qm(x,  A)  >  0,  i.e.  x  €  A0.  In  other  words, 

x  £  Ao  implies  that  (2(z,Ao)  =  0.  ( 


(2.34) 


From  (2.31)  and  (2.34)  we  have  the  equality 

7t(A0)  =  I  x(dx)Q(x,  Ao)  =  j  ir (dx)Q(x,A0) 

J  X  JAo 

which  implies  that  Q(x,  A0)  =  1  for  [7r]-almost  all  x  €  Ao-  Hence 

[  ir 0{dx)Q(x,  C)  =  /---  /  n{dx)Q{x,  C)  =  /  /  tt (dx)Q{x,  C  D  A0)  =  tt0(C). 

J  X  7T(/loj  J  Aq  ao  j  JX 

(2.35) 

Equations  (2.35),  (2.32)  and  (2.33)  establish  the  lemma.  <> 


Define 


for  r  =  1,2, . . .  ,n0  —  1  and 


7Tr(C)=  f  Mdx)Pr(x,C) 

J  An 


|  no-1 

*(C)  =  -  £  *r(C). 

"o  ria 

Note  that  irT  is  the  distribution  of  Xr  when  Vo  =  Ao  has  initial  distribution  tt0.  The  next 
lemma  shows  that  averages  of  the  transition  functions  of  the  embedded  chain  converge  to 
7r(C)  for  [ir0]-almost  all  x. 


Lemma  7  Define 


?o  =  (x  :  x  €  A0,  sup  |Pm,l<)(x,  C)  —  ir0(C)|  — >  0  as  m  — >  ool. 
1  CeB  ) 


(2.36) 


Under  the  conditions  of  Theorem  2 


tt0(Bo)  =  1. 


(2.37) 
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Moreover  for  each  x  £  Bo, 

sup  |Pmno+r(x,  C)  -  7rr(C)|  -►  0  as  m  — +  oo  for  r  =  0, 1, . . .  ,n0  —  1,  (2.38) 

C£0 
and  hence 

1  n°-1  i 

sup  —  Y'  P”ino+r(i,  C)  —  tt(C)  0  as  m  — ►  oo.  (2.39) 

C£1S  fto  r=0 

Proof  From  Lemma  6  the  embedded  Markov  chain  satisfies  the  conditions  of  Theorem  1 
with  the  n0  appearing  in  (1.5)  equal  to  1.  From  Proposition  2  it  follows  that 

sup  |Pmn°(x,C)  —  5r0(C)|  — ►  0  as  m  — ►  oo 
for  [^o]-almost  all  x.  This  establishes  (2.37).  For  r  =  0, 1, . . .  ,n0  —  1  and  x  £  Bo, 

sup  |Pmno+r(x,C)  —  7rr(C)|  =  supl/  (Pmno(x,dy)-7r0(di/))Pr(j/,C)| 
cetf  ceB1-**  1 

<  sup  |Pmn°(x,  D)  -  ir0{D) |  0 

nee 

as  m  -*  oo  establishing  (2.38).  0 

The  next  lemma  shows  that  the  conclusions  of  the  previous  lemma  hold  [ir]-almost 
everywhere. 

Lemma  8  Under  the  conditions  of  Theorem  2, 

TT(Ar)  =  1  for  r  =  1, . . . ,  n0  —  1 
and  (2.39)  holds  for  [xj-aimost  all  x. 

Proof  Consider  the  original  Markov  chain  X0,X\, —  Let  E  £  B  and  let  7t0(B)  =  1. 
Then 

/  jtj (dx)Pno_1(x,£)  =  f  f  x0(dy)P(y,  dx)Pn°~l(x,  E) 

J X  Jx£X  JyGAo 

=  /  *0(dy)Pno(y,E) 

JAo 

=  7To(P)=l 

and  hence  Pno-1(x,B)  =  1  for  [TTij-almost  all  x.  In  particular  we  take  E  —  Bo  and  rewrite 
the  conclusion  as  Ti(Bi)  =  1  where  the  sets  Br  are  defined  by 

Br  =  {x  :  P"»-r(i,  Bo)  =  l},  r  =  1, 2, . . . ,  n0  -  1. 

Let  x  €  B\.  Then 

53  Pmr*'\x,A)  >  f  Pno-1(x,  dy)  £  P(m-1)n0(t/,A)  >  0 

m>2  ^  m>2 
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and  hence  x  €  A\.  Thus  B\  C  A\. 
that  for  x  £  Bi 


Similarly,  rT(Br)  =  1  and  Br  C  Ar  for  all  r.  Notice 


sup  |Pm"0+r+"0-1(x,  C)  —  7rr(C)j  <  /  sup  \Pmno+T(y,C)  —  7rr(C)|Pr‘0-1(x, <fy) 

CgO  Jy€X  C£B 

=  [  sup  |P™0+r(y,  C )  -  7Tr(C) IP”0-1  (x,  </y ). 

JveBo  CgO 

From  (2.38)  it  follows  that 


sup  )p^+r+"o-i(z,  C)  -  TTr(C)\ 
ego 


As  a  consequence, 


0  for  [7Ti]-almost  all  x  as  m  — ►  oo. 


(2.40) 


sup 

CgO 


1  "°-1  I 

—  £  Pmno+r+n°-1(x,C)-7r(C)| 


Now 


0  for  [TTij-almost  all  x  as  m-+  oo.  (2.41) 


i  1  n°~1  1  Ti0_1 

sup  —  y;  p(m+1)no+r(a:,  C)  -  x(C)  <  sup  —  y  pmno+r+no-l(Xi  _  jr (C) 
ego  'n0  “  I  ego  'no  fTo  1 

+  —  sup  |p(m+1)no+"o-l(x?  C)  -  Pmno+no_1(x,  C)| 
no  ego  1  1 

and  as  m  — >  0  this  converges  to  0  for  [7Ti]-almost  all  x,  from  (2.40)  and  (2.41).  A  similar 
argument  shows  that  (2.39)  holds  for  [7rr]-almost  all  x  and  all  r  and  hence  for  [7f]-almost 
all  x.  <> 


We  now  establish  that  ir  =  7r  by  using  the  full  force  of  condition  (1.4). 

Lemma  9  Under  the  conditions  of  Theorem  2,  7rr  is  the  restriction  of  n  to  AT,  for  r  = 
1,2, . . .  ,n0  —  1  and 

7T  =  jf. 


Proof  We  have  already  shown  that  7Tr(Ar)  =  1,  r  =  1,2, . . .  ,no  —  1.  We  will  now  show 
that  A0, . . . ,  Ano_ i  act  like  cyclically  moving  subsets  in  the  sense  that 

x  €  Ac0  implies  that  P(x,Aj)  =  0. 

Suppose  that  P(x,Aj)  >  0.  Then 

l  P(.x,dy)’£Pm'K-'( V,A)>0, 

m>l  *Ml  m>l 


which  implies  that  x  €  A0.  Thus  x  €  Aq  implies  that  P(x,  Ai)  =  0.  Now  for  C  € 

*i(C)  =  tt^CDAi) 

=  ^AT)L’r(dx)P{x'CnA<) 

=  L*{ix)p{x'c  nA,) 

7r(C  n  A;) 

*(A0) 
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Since  ~i{A\ )  =  1,  this  implies  that  7r(/4i)  =  tt(Ao)  and  that  7T]  is  the  restriction  of  z  to  A\. 
A  similar  conclusion  holds  for  7rr  for  other  values  of  r. 

We  now  use  the  full  force  of  Condition  (1.4)  which  can  be  restated  as  =  1. 

This  together  with  the  fact  that  ttt  is  the  restriction  of  ir  to  Ar,  r  =  0, 1, . . . ,  no  —  1  implies 
that  the  probability  measures  ir  and  ft  are  absolutely  continuous  with  respect  to  each  other. 
From  this  observation  and  Lemma  8,  for  any  C  €  B, 

■j  mrifl 

Hmno(x,C)  = - £  Pj(x,C)  -  ft (C) 

mn0 

for  [x]-almost  all  x.  Now, 

7T(C)=  /  7T (dx)Hmno(x,C)  ^  /  7T (dx)ft(C)  =  ft(C). 

J  X  J  X 

This  shows  that  n  =  ft.  <> 

We  now  complete  the  proof  of  Theorem  2 

Proof  of  Theorem  2  It  is  clear  that  Lemmas  8  and  9  establish  conclusions  (1.8)  and 
(1.9)  of  Theorem  2.  Let  f(x)  be  a  measurable  function  satisfying  /  \f(x)\x(dx)  <  oo.  From 
a  slight  extension  of  Corollary  2  as  applied  to  the  embedded  Markov  chain  for  the  averages 
of  /(•)  over  the  whole  chain,  we  obtain 

MBs)  =  1 

where 

Bj  =  j*  :  /  f(x)Hdx)  as  n  ->  ooj  =  lj. 

From  the  argument  at  the  beginning  of  the  proof  of  Lemma  8  we  have  Pn°~l(x,  Bj)  =  1 
for  ^-almost  all  z.  The  definition  of  Bj  is  such  that  if  Pn°~1(x,Bf)  =  1  then  z  6  Bj. 
Hence  ni{Bj)  —  1,  and  similarly  7r t{Bj)  =  1  for  r  =  2, 3, . . ..  This  together  with  the 
fact  that  ft  =  7r  establishes  (1.10).  Conclusion  (1.11)  follows  from  (1.10)  and  the  uniform 
integrability  of  L  ]T”_ j  f(Xj)  under  Px  for  [xj-almost  all  z.  O' 

2.4  Rates  of  Convergence  and  Remarks 

The  proof  of  the  convergence  of  Pn(z,  •)  to  7r(-)  rested  mainly  Parts  (b)  and  (c)  of  Lemma  3. 
We  can  translate  the  equivalence  of  (2.14)  and  (2.15)  stated  in  Remark  1  following  the  proof 
of  Lemma  3  to  results  on  geometric  convergence  in  the  ergodic  theorem  for  Markov  chains. 
We  will  state  such  a  result  and  give  a  brief  proof. 

Theorem  6  Suppose  that  the  conditions  of  Proposition  2  hold  and  there  is  a  t0  >  0  such 
that 

exp  (nto) 

n=l 

Then  there  is  a  set  D0  with  r(D0)  —  1,  such  that  for  each  x  €  D0,  there  is  a  0  with 
exp(— <o)  <  0  <  1  and  a  K  <  oo  such  that 

sup  |Pn(z,C)  —  7r (C)|  <  Kfin. 
cen 

The  constants  j3  and  K  can  depend  on  x. 


J  | Pn{x,A)  -  Pn+1{x,A)\p(dx)  <  oo.  (2.42) 
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We  remark  that  if  in  Proposition  2  the  set  A  can  be  taken  to  be  the  whole  space  W, 
then  the  series  (2.42)  converges  automatically.  In  this  case,  Theorem  6  asserts  geometric 
convergence,  which  is  qualitatively  the  same  result  as  that  given  by  Theorem  3  (it  does  not 
give  the  explicit  constant  given  in  Theorem  3,  however). 

To  prove  Theorem  6,  we  will  need  the  following  lemma. 

Lemma  10  For  each  positive  integer  n, 

Pn(A,C)  =  [  p(dx)Pn~1(x,dy)P(y,C)  for  C  £  B  (2.43) 

Jx,y€X 

and 

Pn(A,  A)  =  em  f  p(dx)Pn~1(x,A).  (2.44) 

Jxex 

The  lemma  is  proved  by  induction  on  n.  For  n  =  1,  (2.43)  is  the  same  as  definition  (2.21) 
of  P(x,C).  The  induction  step  is  carried  out  by  direct  calculation.  Equation  (2.44)  follows 
from  (2.43)  and  (2.21). 

Proof  of  Theorem  6  Use  the  construction  of  the  Markov  chain  on  the  enlarged  space 
X.  as  in  the  proof  of  Proposition  2.  Let  /n(A,A)  be  the  probability  that  the  Markov 
chain  Xn  starting  at  A  reaches  A  for  the  first  time  at  time  n.  Identify  pn  in  Lemma  3 
and  Remark  1  with  /"(A,  A).  This  is  what  was  done  in  the  ;ioof  of  Proposition  2  in  an 
indirect  fashion  while  appealing  to  Proposition  1.  It  is  easy  to  see  that  the  rn  appearing  in 
Lemma  3  and  Remark  1  is  Pn( A,  A).  From  (2.44),  Condition  (2.14)  reduces  to  Condition 
(2.42).  Theorem  6  now  follows  from  Remark  1  on  rates  of  convergence.  <C> 

Remark  3  In  Section  1,  we  described  how  to  form  a  transition  function  from  the  two 
conditional  distributions  tcxx\Xi  and  tta'2|Xi  obtained  from  a  bivariate  distribution  x.  We 
mentioned  that  for  a  Markov  chain  with  such  a  transition  function  to  converge  in  distri¬ 
bution  to  7r  it  is  necessary  that  xa',|A'2  and  xa'2|a'i  determine  x.  Some  researchers  have 
pondered  over  the  question  of  when  do  the  conditional  distributions  determine  the  joint 
distribution.  Besag  (1974)  noted  that  uniqueness  is  guaranteed  if  the  distributions  are 
discrete  and  the  support  of  x  is  a  permutation  invariant  set.  Theorem  1  gives  a  sufficient 
condition  for  uniqueness  in  the  general  case. 

One  can  give  a  simple  nondegenerate  example  to  show  that  in  general,  the  two  condi¬ 
tional  distributions  do  not  determine  the  joint  distribution.  Let  Xi  have  a  density  function 
p{x)  such  that 

y:  p(m  +r)  =  Cr  <  oo  for  each  r  €  [0, 1). 

-oo<m<oo 

The  density  function  p(x)  =  |exp(  — |a:j),  for  instance,  satisfies  this  condition.  Let  xa-2|a, 
be  the  distribution  that  puts  masses 


and 


1/2  at  X\  -(-  1 
1/2  at  x\  —  1. 


(2.45) 
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This  determines  the  other  conditional  distribution  ttamAj-  This  puts  masses 


and 


P(j  2  +  1) 

P(* 2  +  1)  +  p{x2  ~  1) 
P(X 2  ~  1) 

P(X 2  +  1)  +p(l2  ~  1) 


at  x2  +  1 


at  X2  —  1. 


(2.46) 


It  can  be  seen  that  the  two  conditional  distributions  (2.46)  and  (2.45)  do  not  uniquely  deter¬ 
mine  a  joint  distribution  for  (AT,  AT).  Fix  r  €  [0, 1)  and  consider  the  discrete  distribution 
pr  on  the  points  m  +  r,  m  =  . . . ,  —1,0, 1, . . .  defined  by  pr(m  +  r)  =  ^p(m  +  r).  Let  Vj(r) 
be  distributed  according  to  pr,  and  let  the  conditional  distribution  of  Y2(r)  given  yj(r)  be 
the  distribution  defined  in  (2.45).  It  is  easy  to  see  that  distribution  of  V!(r)  given  F2(r)  is 
that  given  in  (2.46),  and  the  joint  distribution  of  (yj(r),  y2(r))  has  the  same  conditional 
distributions  as  (Ai,AT). 

It  is  even  possible  to  find  joint  distributions  with  continuous  marginals  for  which  the 
conditionals  are  given  by  (2.46)  and  (2.45).  Let  /(r)  be  any  probability  density  on  [0,1). 
Let  R  have  density  function  f(r)  and  put  (Zi,  Z2)  =  (Y\(R),  Y2(R)).  Clearly  the  conditional 
distributions  of  ( Zi,Z2 )  are  a s  in  (2.46)  and  (2.45).  The  marginal  distribution  function  of 
Z\  is  given  by 


P(Zi<x)=l '  (  £  Pr(m  +  r)\f(r)dr  =  f 

■O0’1!  '  m:m+r<x  '  CV~[v] 

A  similar  expression  can  be  written  down  for  the  distribution  function  of  Z2.  Notice  that 
Zj  and  Z2  have  density  functions. 


3  Remarks  on  the  Sampling  Plan 

In  Section  1  we  mentioned  that  there  are  a  number  of  ways  of  using  the  Markov  chain  to 
estimate  tt  or  some  aspect  of  n.  One  can  generate  G  independent  chains,  each  of  length  n, 
and  retain  the  last  observation  from  each  chain,  obtaining  a  sample  . . . , 

of  independent  variables.  At  another  extreme,  one  can  generate  a  very  long  sample 
A'0,  AT,  A2, . . . ,  ATg,  and  use  A„,  A2n,  •  •  • ,  AT?n,  which  form  a  nearly  i.i.d.  sequence  from 
7T.  This  is  at  approximately  the  same  cost  in  CPU  time.  (Clearly  intermediate  solutions 
are  possible).  If  the  objective  is  to  estimate  an  expectation  /  /(i)Tr(dx),  then  there  is  no 
reason  to  discard  the  intermediate  values  from  a  long  chain,  and  one  can  use 

l  nG 

~fr — n  £  /(*<)•  (3-D 

The  almost  sure  convergence  of  (3.1)  follows  from  Theorem  2  under  the  assumption 
f  f(x)7r(dx)  <  oo  (note  that  we  do  not  need  the  aperiodicity  condition  (1.6)).  Thus,  from 
the  point  of  view  of  estimating  a  particular  expectation  /  f(x)7r(dx)  or  probability  it  is 
clear  that  the  optimal  way  of  using  the  Markov  chain  is  to  use  (3.1),  and  so  it  is  natural  to 
ask  why  one  should  bother  to  prove  results  such  as  (1.7).  In  the  Bayesian  framework,  there 
is  another  aspect  that  must  be  considered,  which  is  that  generally,  in  the  exploratory  stage, 
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one  is  interested  in  calculating  posterior  distributions  and  densities  for  a  large  number  of 
prior  distributions.  It  will  usually  not  be  feasible  to  run  a  separate  Markov  chain  for  each 
prior  of  interest  (the  time  needed  is  on  the  order  of  several  minutes  for  each  prior).  Instead, 
one  will  want  to  get  a  sequence  of  random  variables  Xi, . . .  ,XT  distributed  according  to  the 
posterior  distribution  with  respect  to  some  fixed  prior,  and  then  use  that  same  sequence 
to  estimate  the  posterior  with  respect  to  many  other  priors.  (We  discuss  how  this  may  be 
done  in  the  next  paragraph.)  The  important  point  here  is  that  if  there  are  a  large  num¬ 
ber  of  priors  involved,  then  the  manipulations  of  the  sequence  X\, . . . ,  Xr  to  produce  the 
posterior  for  each  prior  must  be  done  very  quickly.  This  restricts  the  size  of  r,  and  so  one 
will  generally  want  the  sequence  Xi,..., XT  to  be  independent.  This  precludes  running  a 
very  long  chain  and  taking  sample  averages  as  in  (3.1).  Instead,  one  will  want  to  generate 
independent  chains  and  retain  the  last  random  variable  in  each  chain  or  take  a  long  chain 
and  retain  only  random  variables  at  equally  spaced  intervals. 

We  now  discuss  in  more  detail  how  one  might  use  one  sequence  X\ , . . . ,  XT  to  calculate 
posteriors  with  respect  to  many  priors.  We  depart  from  the  notation  of  the  paper  and  switch 
to  the  notation  usually  used  in  Bayesian  analysis.  Suppose  that  v ^  is  a  family  of  priors  for 
the  parameter  0.  Here,  h  lies  in  some  interval  and  we  think  of  it  as  a  hyperparameter  for 
the  prior.  Suppose  that  we  are  in  the  dominated  case,  i.e.  there  is  a  likelihood  function 
lx(9),  where  X  now  represents  the  data. 

Let  i>h,x  be  the  posterior  distribution  of  9  when  the  prior  is  i//,.  We  know  that  v^x  is 
dominated  by  and 

=  <*(*)/,(«), 

where  c/,(A)  is  a  normalizing  constant. 

Consider  the  case  where  we  can  generate  observations  9i,  02,  •  • . ,  0r  from  i/0tx  and  there¬ 
fore  estimate  f  f(9)di/0ix \9)  by  (l/r)£i=i  We  will  indicate  now  how  we  can  obtain 

estimates  of  /  f{d)dvh<x  {9)  for  h  ^  0. 

Suppose  that  i //,  is  dominated  by  vq.  Then  it  is  clear  that  Vh.,x  is  dominated  by  Vo,x 

and 

dl/h,x  _  ch(X)dvhta. v 

1 - W  ~  f  Y\  A 

dv c,x  Co  (A )  dv0 

since  the  likelihood  lx(0)  cancels.  We  may  write 

j  mdvKjtw  =  ]  m^rndvoMO)  = 

Substituting  f(9)  =  1  in  the  above  we  can  obtain  the  constant  and  write 


/  f(0)dn,xW  = 


Thus,  we  may  estimate  /  f{0)duh,x{9)  by 


where  wh,i 

1=1 


n«i  fewr 
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This  is  the  well-known  ‘'ratio  estimate”  in  importance  sampling  theory.  The  key  here  is  its 
calculation  requires  only  knowledge  of  the  ratio  up  to  a  multiplicative  constant.  See 
Hastings  (1970). 

Now  in  some  Bayesian  problems,  for  instance  problems  with  missing  or  censored  data, 
the  likelihood  function  lx(0)  is  either  extremely  difficult  or  impossible  to  calculate.  (An 
example  of  this  arises  in  Doss  (1991).)  The  fact  that  this  likelihood  cancels  means  that 
the  estimation  of  the  expectation  under  the  prior  va  requires  only  the  recomputation  of  r 
weights,  and  this  can  be  done  very  fast. 

It  will  often  be  the  case  that  we  wish  to  consider  not  just  one  function,  but  rather 
a  family  of  functions.  As  a  simple  example,  if  we  wish  to  estimate  the  entire  posterior 
distribution  of  0,  then  in  effect  we  wish  to  consider  ft(0)  =  1(0  <  t )  for  a  fine  grid  of  values 
of  t.  On  a  Sparcstation  1,  for  r  =  50  we  have  been  able  to  do  the  computations  fast  enough 
to  dynamically  display  the  estimates  of  the  posterior  distributions  J  1(0  <  t)disf ltx(0)  as  h 
varies,  using  the  program  Lisp-Stat  described  in  Tierney  (1991).  For  larger  values  of  r  it 
was  necessary  to  precompute  these  estimates. 
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